join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @ritchie46, I'm seeking early feedback on this PR
At first I was investigating #9603. The problem there is that literals (/ single value expressions) are not being broadcasted properly.
Then I was reading the code and noticed some other issues. We have disallowed aliases in join expressions (#6312), however, this still has some flaws when there are duplicate names: e.g.
The column
a
is overridden with the last expression in left_on. I later saw similar reports in #8874, #13220My idea for the latter problem is to not hstack the left_on/right_on expressions onto the dataframes before the join (as currently done). I assume these series do not need to be kept in the result, so they are dropped when the join has finished. If we do want to keep them, we could either error when there are duplicate column names, or add suffixes to columns e.g.
a_left1
,a_left2
in the example above.Assuming we don't keep those series, the next problem is knowing which ones to drop. Example
Currently we will get the following result. Note that "c" is dropped, it seems to me that you want to keep it in this situation.
A sensible new condition for dropping columns in the right df could be: if the left_on/right_on expressions were already columns in the original dfs. I have added an implementation to do this for left joins in the PR so far.
Example
We dropped
c
here becausea
andc
are not calculated expressions and they already exist in the left/right dataframes.I hope these examples are not too confusing 😄
Final thing, if we are not including the left_on/right_on series, do we still need to disallow aliases?
Thanks!